Statistics Without Borders 2024
This presentation provides an review of findings from relevant analytical process documents from Impact Initiatives, best practices, and recommendations for improvement. The documents reviewed in this presentation are:
Guidelines_2-3_Quant Data Analysis_Annex3
Github: cleaningtools
Github: analysistools
Github: presentresults
TBD
The Impact Initiatives team have faced challenges to solve for the below pain points. This report aims to solve for these issues and provide best practices in general.
Handle missing data.
Correct size bias in the assessment.
Provide calibration techniques to help with better estimates.
This report uses the 2023 Ukraine | REACH Health Sector Need Assessment dataset to illustrate examples of recommendations shared in this presentation.
[1. “Sampling Frame” Section]
[2. “Sample” Section]
[3. “Extended DAP” Section]
About
The “Sampling Frame” section includes information on cluster names, strata variables, and population data. It focuses on different localities, administrative levels, and household numbers.
Best practices observed:
Use of stratified sampling with multiple strata levels (e.g., locality, admin levels).
Clear listing of variables and population numbers for different areas.
Stratified sampling: In the UKR2217a dataset, there is a variable ‘oblast’ that covers four areas: Vinnytska, Dnipropetrovska, Zaporizka oblasts and Kharkiv city. This is used to demonstrate how stratification works and that these areas are included in the sample. There will be demographic changes due to high-risk conflict zones with forced migration and displacement. This means the frame needs to regularly updated to reflect these changes.
| cluster name variable | first strata name variable | population |
|---|---|---|
| oblast | Admin1 | HH_number |
| UA12 | Dnipropetrovska | [population number] |
| UA63 | Kharkivska | [population number] |
| UA05 | Vinnytska | [population number] |
| UA23 | Zaporizka | [population number] |
Ensure the frame is regularly updated to reflect demographic changes. If there’s data indicating new settlements or demographic shifts (like a ‘date of settlement’ column), it could be used to show the need for regular updates in the sampling frame.1
Population units: Reflect the current population structure post-conflict.
Cluster names: Ensure they match the administrative divisions in the dataset.
Validate the frame with local sources or GIS data for accuracy especially if:
The geographic scope of the assessment includes areas with dynamic population distributions or recent changes due to migration, natural disasters, or other factors.
There is a need to ensure that all sub-populations within the geographic coverage are represented accurately, which is critical in humanitarian contexts like in Ukraine.
About
The “Sample” section almost mirrors the “Sampling Frame” section with a direct connection between the frame and the actual sample drawn.
Best Practices Observed:
Consistency in the use of strata and cluster names between the sampling frame and the sample.
Representation of diverse localities in the sample.
Consistency in strata and cluster names: The regional or demographic categories in the sampling frame match those in the actual sample data.
| cluster name variable | first strata name variable | RDS Used | CHW Verified | number of surveys |
|---|---|---|---|---|
| locality | Admin1 | (Y/N) | (Y/N) | |
| city_1 | UA12 | Y | N | 16 |
| city_2 | UA63 | N | Y | 6 |
| city_3 | UA05 | Y | Y | 17 |
Other sampling techniques: If the dataset has variables indicating transient populations, we could account for respondent-driven sampling.1
Cross-verification with community workers: If there are variables indicating community health worker interactions, these could be used to cross-check the representativeness of the sample.
Number of surveys: Adjust planned versus actual surveys conducted in each stratum.
Selected units: Document any inaccessible areas and remove them from the sample.
About
The “Extended DAP” (Data Analysis Plan) section includes research questions, hypotheses, indicators, and variables. It guides the analytical approach for specific research questions.
Best Practices Observed:
Structured approach with clear hypotheses, indicators, and variables.
Focus on relevant research questions with a defined hypothesis type.
Structured approach to data analysis: The dataset has different health indicators, which can be used to demonstrate structured analysis around hypotheses like “Access to clean water is less than 50% in rural areas”.
Primary research question: What is the impact of proximity to conflict zones on health service availability?
Sub-research questions: How does the damage to infrastructure correlate with disruptions in medical services? What are the critical staffing shortages affecting healthcare delivery?
Type of analysis: Define analyses planned versus those conducted.
Hypotheses: Link each analysis to the hypotheses it’s intended to test.
About
The cleaningtools package on GitHub provides a comprehensive set of tools for data cleaning and is structured into three main components.
Check: This component includes functions to flag potential issues with data such as personal identifiable information (check_pii), audit durations (create_audit_list, add_duration_from_audit), and outliers (check_outliers).
Best practices observed:
PII Scrubbing: The check_pii function is a good practice to ensure data privacy by identifying potential personal information.
Audit Trails: By checking audit file durations, this tool ensures data quality by confirming reasonable response times.
Outlier Detection: The check_outliers function is critical for identifying and handling anomalies in the dataset which can skew results.
Recommendations for improvement:
Missing data: Implement methods for handling missing data, possibly through imputation techniques or weighted adjustments.
Size bias: Provide tools for correcting size bias, such as scaling factors based on sample versus population size.
Calibration techniques: Include functions for calibration to adjust survey weights based on known population characteristics, which can help in producing better estimates.
To apply these recommendations to the Ukraine dataset, we would:
Assess the extent and pattern of missing data.
Apply correctional weights if size bias is detected in the sample.
Use calibration techniques to align survey results with known benchmarks, improving the accuracy of estimates.
Published on January 4, 2024